Adaptive Learning of Uncontrolled Restless Bandits 

with Logarithmic Regret 



Cem Tekin, Mingyan Liu 



o 



O 

u 
o 



> 

(N 

o 
^, 

l> 

o 



% 



Abstract — In this paper we consider the problem of learning 
the optimal policy for the uncontrolled restless bandit problem. 
In this problem only the state of the selected arm can be observed, 
the state transitions are independent of control and the transition 
law is unknown. We propose a learning algorithm which gives 
logarithmic regret uniformly over time with respect to the optimal 
finite horizon policy with known transition law under some 
assumptions on the transition probabilities of the arms and the 
structure of the optimal stationary policy for the infinite horizon 
average reward problem. 



I. Introduction 

In an uncontrolled restless bandit problem (URBP) there 
is a set of arms indexed by 1, 2, . . . , m whose state process 
is discrete and follows a Markov rule independent of each 
other. The user chooses one arm at each step, gets the reward 
and observes the current state of that arm. The control action, 
i.e., the arm selection, does not affect the state transitions. 
However it is both used to exploit the instantaneous reward 
and to decrease the uncertainty about the current state of the 
system by exploring. Thus the optimal policy should balance 
the tradeoff between exploration and exploitation. 

If the structure of the system, i.e., the state transition 
probabilities and the rewards of the arms are known, then 
the optimal policy can be found by dynamic programming 
for any finite horizon problem. In the case of infinite horizon, 
stationary optimal policies can be found for the discounted 
problem by using the contraction properties of the dynamic 
programming operator. For the infinite horizon average reward 
problem, stationary optimal policies can be found under some 
assumptions on the transition probabilities fT], ||2|- However, 
knowing the structure of system before using the system is a 
strong assumption. In most of the systems, the user does not 
have a perfect model for the system at the beginning but learns 
the model over time. Therefore, we assume that initially the 
user does not know the transition probabilities of the arms. The 
user learns them over time based on its observations. Thus, our 
goal is to design learning algorithms with fastest convergence 
rate, i.e., minimum regret where regret of a learning policy 
at time t is defined as the difference between reward of the 
optimal policy for the undiscounted i-horizon problem with 
full information about the system model and the undiscounted 
reward of the learning policy up to time t. 

In this paper we show that under some assumptions on the 
transition probabilities of the arms and the structure of the 
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optimal policy for the infinite horizon average reward problem, 
algorithms with logarithmic regret uniformly in time with 
respect to the optimal policy for the finite time undiscounted 
problem with known transition probabilities exist. We also 
claim that logarithmic order is the best achievable order for 
URBP. To the best of our knowledge this paper is the first 
attempt to extend the optimal adaptive learning to partially 
observable Markov decision processes (POMDP). 

Related work in optimal adaptive learning started with 
the paper of Lai and Robbins fJI], where the asymptotically 
optimal adaptive policies for the multi-armed bandit problem 
with i.i.d. reward process for each arm were constructed. 
These are index policies and it is shown that they achieve 
the optimal regret both in terms of the constant and the 
order. Later Agrawal H considered the i.i.d. problem and 
provided sample mean based index policies which are easier 
to compute, order optimal but not optimal in terms of the 
constant in general. Anantharam et. al. ||5l, 16] proposed 
asymptotically optimal policies with multiple plays at each 
time for i.i.d. and Markovian arms respectively. However, all 
the above work assumed parametrized distributions for the 
reward process of the arms. Auer et. al. Q considered the i.i.d. 
multi-armed bandit problem and proposed sample mean based 
index policies with logarithmic regret when reward processes 
have a bounded support. Their upper bound holds uniformly 
over time rather than asymptotically but these bounds are not 
asymptotically optimal. Following this approach Tekin and Liu 
ID, m provided policies with uniformly logarithmic regret 
bounds with respect to the best single arm policy for restless 
and rested multi-armed bandit problems and extended the 
results to multiple plays ifTol . Decentralized multi-player ver- 
sions of the i.i.d. multi-armed bandit problem under different 
collision models were considered in ifTTJI . lfT2l . lfT3l . Other 
research on adaptive learning focused on Markov Decision 
Processes (MDP) with finite state and action space. Burnetas 
and Katehakis ifT?! proposed index policies with asymptotic 
logarithmic regret, where the indices are the inflations of right- 
hand side of the estimated average reward optimality equations 
based on Kullback Leibler (KL) divergence, and showed that 
these are asymptotically optimal both in terms of the order and 
the constant. However, they assumed that the support of the 
transition probabilities are known. Tewari and Bartlett ifTSl 
proposed a learning algorithm that uses li distance instead 
of KL divergence with the same order of regret but a larger 
constant. Their proof is simpler than the proof in lfT4ll and 
does not require the support of the transition probabilities 
to be known. Auer and Ortner proposed another algorithm 
with logarithmic regret and reduced computation for the MDP 



problem, which solves the average reward optimaUty equations 
only when a confidence interval is halved. In all the above 
work the MDPs are assumed to be irreducible. 

The organization of the remainder of this paper is as follows. 
In Section HI] we give the problem formulation, notations and 
some lemmas that will be used throughout the proofs in the 
paper In Section [III] we give sufficient conditions under which 
the average reward optimality equation has a solution. In 
Section |IV] we give an equivalent countable representation of 
the information state and an assumption under which the regret 
of a policy can be related to the expected number of times a 
suboptimal action is taken. Then, we give an upper bound for 
the regret of an admissable policy in Section |Vl In Section [VTl 
an adaptive learning algorithm is given, and in Section [VTll an 
upper bound for the regret of the adaptive learning algorithm 
is derived. Section IVIIII concludes the paper. 

Due to page limitations, some proofs are not given. They 
will be included in the full version of the paper 

II. Problem Formulation, Preliminaries and 

Notation 

N = {1,2,...} is the set of natural numbers, Z+ = 
{0, 1, . . .} is the set of non-negative integers, (. • .) represents 
the standard inner product, ||.||i represents the li norm for 
vectors and the induced maximum row sum norm for matrices. 
For a vector or group of matrices v, (w-u,w') represents the 
vector or a group of matrices whose uth element is v', while 
all other elements are the same as the elements of v, ex is 
the unit vector whose zth component is one while all other 
components are zero, and whose dimension will be clear from 
the context. Unit vectors with dimension | Sk \ are represented 
by e^ Let /? = E*ti l/t'- 

Assume that there are m arms, indexed by the set M = 
{1,2,..., m}. We assume that all arms are independent, irre- 
ducible, aperoidoc, discrete time Markov chains i = 0, 1, . . .. 
Let Xk e Sk denote a state of arm k where Sk is the state 
space of arm k. For simplicity Xk also represents the reward 
from state Xk of arm k and we assume that Sk C) Si = for 
k ^ / without loss of generality. Then the state space of the 
system is S* = 5*1 x . . . x Sm and x = {xi , . . . , Xm) £ S* is a 
state of the system. Let rmax = max^^^gSk^fceM Xk- Let Xk^t, 
Xt = {Xi,t, ■ ■ ■ ,X„i^t) be the random variable representing 
the state of arm k at time t and the state of the system at time 
t respectively. Pk is the transition probability matrix of arm 

k, Pk.xkx'^ = {Pk)xkx'^ = P{Xk,t+i = x'f^\Xk^t = Xk) and 
P ^ (Pi, . . . , P„i) is the set of transition probability matrices. 
There is a user who selects one of the m arms at any t 
and gets the reward from that arm depending on the state of 
that arm with the goal of maximizing the undiscounted sum 
of the rewards for any finite horizon. The user does not know 
P thus he needs to balance exploration and exploitation in 
order to maximize his reward. Moreover, the user can only 
observe the state of the arm he chooses and cannot observe 
the state of the other arms. Thus, the user should learn the 
uncontrolled POMDP problem. The action and observation 
spaces of the user at any time t are U = {!,..., m} and 
Y = Uk=i^k respectively. Then ut ^ U, yt <E Y are the 



action and the observation of the user at time t respectively 
and Ut, Yt are the random variables representing the action 
and the observation at time t respectively. The history at time 
t is z* = {uo,yi,ui,y2, . ■ . ,ut-i,yt)- 

Let Qp{y\u) be the substochastic transition probability 
matrix such that {Qp{y\u))xx' = Pp{Xt+i = x',Yt+i = 
y\Xt = x,Ut = u). For URBP Qp{y\u) is the zero matrix 
for y ^ Su- For y ^ Su only nonzero entries of Qp{y\u) are 
the ones for which Xu = y- 

Finally we give useful definitions and lemmas. 

Lemma 1: for PkiP'k ^ [0, 1] we have 
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(1) 



The norm used in the equations below is the total variation 
norm. For finite and countable vectors this corresponds to li 
norm, and the induced matrix norm corresponds to maximum 
absolute row sum norm. 

Definition 1: |[T6l A Markov chain X ~ {Xt,t e Z+} on 
a measurable space [S, B), with transition kernel P{x, Q) is 
uniformly ergodic if there exists constants p < 1 , C < oo such 
that for all x & S, 



pt 



<Cp\te 



^+y 



(2) 



Lemma 2: (im Theorem 3.1.) Let X = {Xt, t e Z+} be a 
uniformly ergodic Markov chain for which ^ holds. Let X = 
{Xt,t £ Z+} be the perturbed chain with transition kernel 
P. Given the two chains have the same initial distribution let 
ipt , ipt be the distribution of X, X at time t respectively. Then, 



-ipt -i>t 



< 
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(3) 



where i = [logp C ^] . 



III. Solutions of the Average Reward Optimality 
Equation (AROE) 

Assume that the transition probability matrices for the arms 
are known by the user. Then, URBP turns into an optimization 
problem (POMDP) rather than a learning problem. In its 
general form this problem is intractable ITtI, but heuristics, 
approximations and exact solutions under different assump- 
tions on the arms are studied by ifTSl . |fT9l , ||20| and many 
others. 

One way to represent a POMDP problem is to use the 
belief space (information state space), i.e., the set of proba- 
bility distributions over the state space. For URBP with the 
set of transition probabihty matrices P the belief space is 
* = {V : ^^ e Ml^l,V'. > 0,Vx 6 5,E.6sV'x = 1} which 
is the unit simplex in RI'^L Let ipo denote the initial belief 
and Ipt denote the belief at time t. Vp{ip, y, u) = ipQp{y\u)l 
is the probablity that y will be observed given the belief is ip 
and action u is taken, Tp{ij},y,u) = ipQp{y\u)/Vp{ip,y,u) 
is the next belief given that action u is taken at belief ip and 
y is observed, where 1 is the \S\ dimensional column vector 
of I's. Let r be the set of admissable policies, i.e., any policy 



for which action at < is a function of tpo and z*. The AROE 
is 



niax{r(V', u) 



Y, Vp{iP,y,u)hiTpi^,y,u))},{4) 



where r(f/;, u) = {if,' • r{u)) = J^x 



es„ 



^ {ip) is the 



expected reward of action u at behef ip, r{u) — {r{x,u))xi£S 

and r{x, u) = Xu is the reward when arm u is chosen at state 

X. We have the following assumption. 

Assumption 1: Pk.ij > 0,Vfc G M,i,j G Sk- 

Under this assumption existence of a bounded, convex 

continuous solution to the AROE is guaranteed. 



Lemma 3: Let 



h - inf^g*(/i(V')), h^ 
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w 




4',l 
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Under Assumption [T| the following holds: 

(i) |[T] There exists a finite constant gp and a bounded convex 

continuous function hp : '^ ^M. which is a solution to (|4|i. 

(ii) hp^ii^) < hrA^) - Tgp < /ip+(V'),VV' G *. 

(iii) hrAi}) = Tgp + /ip(i/;) + 0(1) as T ^ oo. 

IV. Countable Representation of the Information 

State 

We can represent the information state at time t as 
(s*, T*) = ((s* , . . . , 4J, (r*, . . . , T^J), where s* and r* are 
the last observed state from arm k and time from the last obser- 
vation of arm k to t respectively. This representation requires 
that all arms are sampled at least once, thus we assume that 
initially the user samples from all the arms once even before 
the adaptive learning begins. The contribution of this to the 
regret is at most TOr,„ax- Thus, we assume that the user always 
starts with some initial belief (s°,r*'). This representation of 
information state will correspond to different points in vl> under 
different sets of transition probability matrices. Thus with an 
abuse of notation we use ipp{{s''^,T*)) = ipt E '^ to represent 
an element of the countable representation of the information 
state under P at time t. Let ^c{P) be the set of points 
on VP corresponding to the countable representation of the 
information state under P. Let 0{ip;P), 0{{s,t);P) denote 
the set of optimal actions at belief ip, ipp{{s, t)) respectively. 

Since the user does not know P, at time t he has an 
estimate P* ~ (P*, . . . , P^,) based on his past observations 
and actions. Then the estimated belief according to P* is 
■04 = fipt (^o,^o)g(Mo,2/i,--^,'"t-i,yt) = ^pt((s*,T*)) for 
appropriate functions J7 and fl. Even when the user knows the 
optimal policy for the infinite horizon average reward problem, 
he may not be able to play optimally because he does not 
know the exact belief ^/j* at time t. In this case, in order for 
the user to play optimally, there should be an e > such 
that if \\ipt — V'tlli < £' the set of actions that are optimal in 
ipt should be a subset of the set of actions that are optimal 
in Ipt. We will state an assumption under which this propery 
will hold. We claim that for an arbitrary set P and S this 



assumption will generally be satisfied, but a charactezation of 
conditions on P and S for this property to hold is an open 
problem for the URBR 

Let To denote a mixing time. Based on tq let Q{to,p) be the 
finite partition of '^c{P) into sets Gi^,,,,^i^ such that ik = tq 
or ik = isk,Tk),Tk < To,Sk G Sk- Let s'{Gi^^,,,^i^) = {sk : 
ik ^ To}, -r'(G,;i,...,i„J = {Tfe : ik ^ tq} and 



M{G,,^ 



{k ■.ik = Tq}, 

M - M{G^,^...,^J. 



Then, 



G 



41,. 



{(s,r)GVI'c(P):(s^ 



M{G, 



'M{G,,, 



T'),Sk G Sk,Tk > To,Vfc G X(G4i,. ..,,;„)}. 



We have the following assumption. 

Assumption 2: There exists tq G N such that; (i) Every 
G G Q{to,P) which contains infinitely many elements has a 
suboptimality gap, i.e., the minimum diffirence between the 
right hand sides of the average cost optimality equation under 
the optimal action and a suboptimal action, (5 > 0, for the 
information state which is the stationary distribution for G, 
i.e., Tfc = oo for fc G M{G). (ii) Every G G t/(ro,P) which 
contains only one element has a unique optimal action. 

For a set A € ip let A{e) be the e extension of that set, i.e., 
A{e) ^ {ip G "ii : ip € A or di{ip, A) < e}, where di{il.',A) 
is the minimum li distance between ip and any element of A 

Lemma 4: Let To be the minimum mixing time such that 
Assumption|2]holds. Let L be the total number of groups under 
To. Reindex the groups so we have Gi, . . . Gl. Define J; to 
be the e' extension of the convex hull of the group G; . Then 
3e' > such that for all ?/> G J; a unique action is optimal. 

Proof: Let £,m,ax{To) be the maximum li distance between 
any two elements of any group G G Q{tq,p). We can find a 
To such that £,max{TQ) is small enough so by the continuity of 
the function hp, the suboptimality gap for any element of any 
group that contains infinitely many elements is greater than 
5', where < 5' < 5). Similary by Assumption |2] and using 
the continuity of hp for any group G G ^(to, P) we can find 
an e' > such that the suboptimality gap for any belief ip 
contained in the e' extension of the convex hull of the points 
in G has a suboptimality gap 5" such that < (5" < 5, and 
these e' extensions will not intersect each other ■ 

V. An Upper Bound for Regret 

For any admissable policy 7, the regret with respect to the 
optimal N horizon policy is given by 



E., 



1pO,l 



N 



E^^w 



sup 

7'er 



K 



4'o,i' 



N 



E^^'w 



First we will derive the regret with respect to the optimal 
policy as a function of the number of suboptimal plays. Before 
proceeding we will define expressions to compactly represent 
the right hand side of the AROE. Let 



N-1 



C{tp,u,h) = r{'ip,u)+ <V{ip,-,u),h > 

max 

ueu 

Let D2,2iN,e,Ji) = ^/(||^t-Bd(JO||i<e, 



D2AN,e) = ^/(||Vt-Vit||i>e,^t 
C*{iIj,P) ~ max C{ip,u,hp). j^q 



JV-l 



t=o 



A(V',u;P) = /:*(V',P)-/:(V',^t» (5) /(llV't-V'tlli <e,V't 6 Ji,^t) 

denote the degree of suboptimality of action u at information 

state ip and when the set of transition probability mati'ices is where Bd{Ji) is the boundary of J;. 

P. From Proposition 1 of \Mi we have for all 7 e T Lemma 5: For any P satisfying Assumption |2] 



N-l 



<o ,7 ^Tn{Ji ,u)]< Eg _^ [D, {N, e,Ji,u)] 

Rli^o;P)-Y.EgAA{^P,,U,;P)] (6) +EgjD2{N,e,J,)]+EgjY.^iE''(*))^ (8) 

t=o 
We assume that initially all the arms are sampled once thus „ r 

the initial belief is t/jo = ?/'p((s", r")). Let ^ be the supremum 

over e's such that Lemma |4] holds. Let Ji, . . . J^ be the sets ^^^ 

in Lemma informed by C Tn{Ji,u) = ^^ (/(V't S J;, [/* = u, Et) 

Thus at any time t, the belief ?/!* will be in one of the sets *^*' 

Ji,...,Ji.Let + IiAeJi,Ut^u,Ef)) 

Let ^-^ 

< 51 -^(^* &JiAt&JuUt=u, Et) 
A{Ji,u; P) ^ sup A(V't, u; P) t=o 

■4>eJi 7V-1 Af-i 

Note that if Ut e OiipuP) then A(V't, C/*; P)=0, else J/* ^ + XI ^(^* e J;, ^t ^ J;, C/* = u, ^t) + ^ /(£;f ) 

0(V't;P) then A(V't,t/t;F) < A(Ji,[/t;P) w.p.L Thus we *=o *=« 

have ■r-^ , - 

< Y,I{'4,t^Ji,Ut^u,Et) 

Ki^olP) *=o 

< E<.7[E E /(V'*eJz,C/t = u)A(J,,u;P)] + E^(^*ej,,^*^jz,ii;o+E^(^*'') 

t=0 1=1 ui^OiJi;P) *-" *-° 

= E E <,,[E^(^*eJ.L/. = u)]A(J,.;P) f^e/^+V/fi^^^ 

i=l«^0(J,;P) t=o + i^2,l(A', e)+i'2,2(A/,e, J;) + 2^7(^4 ) 

= E E ^^o.7[rjv(Ji,«)]A(Ji,u;P) (7) 

1=1 u<^0{Ji;P) 



The result follows from taking the expectation of both sides. 



Then we will upper bound Tn{Ji, u) for suboptimal actions 

by a sum of expressions which we will upper bound individ- yi. An Adaptive Learning Algorithm (ALA) 

The Adaptive Learning Algorithm (ALA) given in Figure 

[U consists of exploration and exploitation phases. The explo- 

.^-^ - ration serves the purpose of accuretely estimating the transition 

Di,i{N, e, Ji, u) = 2^ I{ipt e Ji, Ut = u, Et,Ft, probabihties. To accuretely estimate the transition probabiUty 

*~ „ „ vectors from each state i & Sk of each arm k, we need to take 

-L\vt,u) > L (ipt, P) — 2e) gj jgggj logarithmic number of samples. In order to do this we 

^^^ , need to first observe state i of arm k, then observe the next 

i^i,2(.jV, e, J;, Mj — 2_^ l('ipt G Ji,Ut — u,Et,±<t, state so we can update the estimated transition probabilities 

*^''^ ^ pk,ij, j G Sk- However, we need the estimates to form a 

^Wt,u) < C {ipt,P) — 2e) probability distribution. Thus instead of estimates Pk,ij, we 

^~i use the normalized estimates Pk,ij- If all the states of all the 

Di^3{N,e) = ^^I(Et,F^ ) arms are logarithmically sampled by the way descibed above, 

*=•' then ALA will be in the exploitation phase. If ALA is in the 

Di{N, e, Ji, u) = Dis{N, e, Ji,u) + Di^N, e, J;, u) exploitation phase at time t, first it computes ipt, the estimated 

+ Di,-s{N, e) belief at time t, using the set of estimated transition probabiUty 



Adaptive Learning Algorithm 
1; Initialize: set a > 0, t 



16: 



17: 



19: 



20: 
21 

22 
23 
24: 
25 

26: 

27 
28 
29 
30: 
31 



0, N.U = 0,iV^(i,j) 



0,C"=(i) = 0,Vfc e M,i,j e Sk- Then play each arm 
once so the initial information state can be represented as 
an element of countable form (s, t)o. 
while t > do 

^, . . _ lJ(Af''(»j)=0)+JV''(»,j) 
■f^'^.'J ~ |Sfci/(C'=(«)=0)+C'=(i) 

W = {{k,i),keM,ie Sk ■■C'ii) < alogt}. 
itW ^9 then 

EXPLORE 

itu{t-l) eW then 
u{t) = u{t- 1) 

else 

select u{t) £ W arbitrarily 

end if 
else 

EXPLOIT 

Let ipt = f^pt (^_^)^(wo,2/i,...,ut-i,yt) be the 

estimate of the information state at time t based on 

the transition probability estimates P* and history up 

to time t. 

solve gt + ht(i') = max„g(7{r(V', u) + 



Eyes„ ^(V'> y, u)ht{Tpt (Vs y, "))}, Vi/' 



e 



*. 



compute the indices of all actions in current infor- 
mation state: 

Vm e t/, lt{ip*',u) = supp^g2^{f(V't,u) + 
SyeS„^(V't,y,w)/it(7>^^,P„('/'t,2/,'"))} such that 
P*-P 



< 



2 1ogt 
1 - V ^t(u)- 

Let u* be the arm with the highest index, (arbitrarily 
select one if there is more than one arm with the 
highest index) 
u(t) =u*. 
end if 



N, 



u{t) 



N, 



u(t) 



1 



if u{t - 1) = u{t) then 

for i,j e Su(t) do 



if State j is observed at t, state i is observed at 
t - 1 then 

= N<^\i,j) + 1, C"(*)(i) = 
1. 



end if 
end for 
end if 

t:=t + l 
end while 



Fig. 1. pseudocode for the Adaptive Learning Algorithm (ALA) 



matrices P*. Then, it solves the average reward optimality 
equation using P* for which the solution is given by gt and 
ht- We assume that the user can compute the solution at every 
time step, independent of the complexity of the problem. This 
solution is used to compute the indices !((■(/;*, u) for each 
action u e [/ at estimated belief 4't- 2^t(V'*j^) represents the 



advantage of choosing action u starting from information state 
■0*, i.e, the sum of gain and bias, inflated by the uncertainty 
about the transition probability estimates based on the number 
of times action u is chosen. After computing the indices for 
each action, ALA selects the action with the highest index. In 
case of a tie, ALA arbitrarily selects one of the actions with 
the highest index. Note that it is possible to update the state 
transition probabilities even in the exploitation phase given that 
the arm selected at times i — 1 and t are the same. Thus even 
though worst-case exploration rate is logarithmic, in general 
the number of explorations needed may be less than that. In 
the next section we will denote the policy corresponding to 
ALA by 7^. 

VII. Analysis of the Regret of ALA 

In this section we will show that when a is sufficiently large, 
i.e., a > C{P), where C{P) is a constant that depends on P, 
then the regret due to explorations will be logarithmic in time, 
while the regret due to all other terms are finite, independent 
of t. Note that since the user does not know P, he cannot know 
how large he should chose a. For simplicity we assume that 
the user starts with an a that is large enough without knowing 
C{P). However, the user can choose a ~ a{t), a positive 
increasing function over time such that limj^cxj a(f) = oo, 
which will guarantee that after some Iq, a{t) > C{P) for 
t > to. In this case it can be shown that the regret at time A^ 
is in the order of a{N) log A^. 

Let Et be the event that ALA exploits at time t, Ft = 



{ 



< e} and Ct{i) be the number of times state i 



of arm k is observed as the first state in two continuous plays 
of arm k up to time t. Following lemma will be frequently 
used in the proofs. 
Lemma 6: 

1 



^(IpL 



Pfe,: 



> e 



Et)< 



(t+l)2' 

for all t, ik,jk e Sk, k £ M, for a > Cp{e). 

Proof: By using a Chernoff-Hoeffding bound. 

A. Bounding the Expected Number of Explorations 
Lemma 7: 

'N-l 



Urn 



t=0 



<(^|5fc|)alogAr(l + r„,ax), (9) 



k=l 



where T,nax = maxfcgM,tjeSfc E[Tk,tj\ + l, Tk.ij is the hitting 
time of state j of arm k starting from state i of arm k. Since 
all arms are ergodic E\Tk^ij\ is finite for all k,i,j. 
Proof: 

N-l m N-1 

J2 I{E?) <J2J2T. ^(^tii) < a\ogt) 

= E E E ^(^'(*) ^ «i°g*' ^*+i(*) ^ ^ti^) 

fc=lJ65fc t=0 

m N-1 

+ E E E ^(^'(*) < a\ogt,C^^,{i) = CH^) 
k=iieSk t=o 



Taking expectation, 






E '^^?) 



t=0 



<Y,Y.^a\ogN + a log NT^,^) 



fc=i ieSfc 



B. Bounding E^^^^^ [D2 {N, e, Ji)] 

Lemma 8: for a > Cpie/{mSl,^JSi\ . . .\Sm\Ci{P,T))) 
we have 



E^^^^^4D2AN,e)]<2mSl,^(3, 



(10) 



where Ci(P) = max^ Ci(P,r), Ci(P,t) 
maxfcgMC'i(Pfc,Tfe) and C\{Pk,Tk) is given in Lemma 

la 

Proof: 

\ilpt)x - ilpt)x\ -■ 



Next we will bound EP^^^4D2,2iN, e, Ji)]. 

Lemma 9: Let tq be such that Assumption |2] holds. Then 
for e < e/2, EP^^^^ [D2MN, e, J;)] = 0, / = 1, . . . , i. 

Proof: By Lemma ID any ^f £ J; is at least ^ away from 
the boundary of J(. Thus given il't is at most e away from -04, 
it is at least ^/2 away from the boundary of Jj. ■ 

C. Bounding E^^,^a [Di {N, e, Ji , u)] 

First we will upper bound E^^,,^^ [Di,i{N, e, Ji,u)]. Let Sfe 
be the set of Sk x Sk stochastic matrices, S = (Si, . . . , S„i). 
Define the following function: 

MakeOpt {^, u; P, e) := {p - (A, • ■ • , Pm) : Pfc e S^, 
/:(V',u,/i^(Tp(.)))>r(V',P)-e}, 



fe^e'^ 



?n 
ra 

< E||(^^)"<-^"< 

m 
< C,{P,T)Y,\PI-Py 



kJ Xk 



J^,„(F;P,e):=inf{ 



P -P 



P e MakeOpt (^,it;P,e)}. 



fc=i 



(11) 



where last inequality follows from Lemma |2] By (fTTT i 



V't - V't 



^<|5i|...|5™|Ci(P,x)E||n*-A 



/c=l 



By the definition of MakeOpt, for every action u in every 
information state ip there exists a new stochastic matrix group 
such that action u becomes optimal in ip under hp. 

Lemma 10: J^,u{P',P,£) is continuous in its first ar- 
gument. Therefore there exists a function /p.^ such that 
fp^S) > for (5 > and lim^-^o fpA^) = 0. 

Lemma 11: Let (5 > be such that and 5 < 
J^;„(F;P,3e)/2,M ^ O(^;P),0 £ * and a > 
CpifpMS)/imSl,J. Then 

^^o.7-[^i^i(^'^'^''")] ^ (2m5max + 4/5)/3 (12) 

Proof: Since any action can be made optimal at any 
information state the event {Xt{ilJt,u) > C*{ijJt,P) ~ 2e} is 
equivalent to 



Thus we have 



3PeE 



pflUt-Vt >e,Et) 
VII 1 / 

/ rn 

^ P(T.\\Pk-Pk\\^> e/{\Si\ . . . 15„|Ci(P, r)), Et 



Pt-P 



2 21ogt 

1 - Nt{u) 



• iriA,u) 



+ {ViiJt, ., u) . htiTpiijt, ., u))) > C*{i^t,P) - 26) (13) 
On the event Ft we have 



< 






^fc-^fc 



fc=i 



>e/(m|5i|...|5™|Ci(P,T)),Pt 



< 



E E P {\phk3k - Pk,^uJ > 

! i?, 

(m5^,J5i|...|5™|Ci(P,T))' 
< 2mAjjj3^ +1)2' 
where last inequality follows from Lemma |6l Then, 



E ^^'it,y,u){ht{Tp(ipt,y,u)) -h*p{Tp{-4!t,y,u)) 
«/es„ 

yueU,PeE. 

Thus ( fT3] l implies 



<e, 



(14) 



3PeS 



Pt-P 



2 2iogr 

1 Nt(u)/ 



TV-l 



i?^„,,A[^2,l(^,e)] := E^V'0,74 



t=0 



V^t - V't 



>e,-BO 



(y(V't, ., u) . /i^(Tp(^t, ., u))) > C*ii'u P) - 3e) (15) 

From the definition of J^^u{P', P, f) (O implies 

21ogi 



J^,„(Pt;P,3e)< 



< 2m^„^^/3 



Nt{u)' 



Thus we have 

i?i,i(iV,e,Ji,w) 



Af-1 



21ogi 



< Y. I Ut e Ji,Ut = u, J0,„(A; P, 3e) < j;^,Et 



c <^VPeS, 



P~P' 



< 



l2\ogt 



N-l 



<Y^ l(iHeJi,Ut^u,Et 



t=0 



^^..(^;^,3.)<^ + . 



Nt(u) 



(16) 



JV-l 



+ 51 l{il^t€JiMt^u,Et, 



t=o 



'^0..J^;^.3^)>'^^..«(^*;^'3e) + jj(i7) 

Note that ( fTSI l is less than or equal to 

1^. (18) 



By Lemma [To] J^-^„(P; P, 3e) > J^^JPt;P, 3e) + (5 implies 
P* — P\\ > /p,3e((5). Thus (ril us upper bounded by 



, 1 V ^*(") 

(l^(^t ,-,■"*)• Zip (Pp (V't ,•,■"*)) ) 

< {Vii^t, ., li*) • h*piTp{i^t, ., w*))) - e} (20) 

But since h*p is continious there exists Si > such that the 
event in ( l20b implies 



Tpiipt,y,u*) ~Tp{-ipt,y,u*) > Siyy £ S'm 



Again since r(i/', y, u) is continuous in P, there exists (^2 > 
such that above equation implies 



P~P 



> S2 



(21) 



11 

JV-l 
t=0 



P* -F 



>.fpMS)^Et 



Thus 

{lSuU*)<C*{i^t;P)-2e} 

c IvPeS, 



P- P' 



< 



21ogi 



P-P 



>S2 



c 



Taking expectation we have 



{||^*-i.>^4 



Therefore 



pt _p 



>fpMS),Et 



E^ 

AT-l m, 

^EE E p(\pl 

t=o k=i(i^,j^)eSkxSk 

JV-l 



JV-l 



E^,^^A[Di,,iN,e,Ji,u)]<J2P 



ikjk Pk,ikjk 



> 






t=0 



pt _P 



> S2,Et 



< "^^max E W^) 



(19) 



^^ (t+l)2 

Combining (fTSI ) and ( fT9] l we have 



Lemma 12: For a large enough we have 

Proof: If suboptimal action u is chosen at information 
state if^t this means that for the optimal action u* G 0{ipt, P) 

This implies 



W-l m , 5. 

^ E E E p[ \^.k,k ' p^.k.k I > ^^^E, 

t=0 fc=l(j^jfc)GSfcXSfc ^ 
W-1 ^ 

for a > Cp{62/{mSl,^)). ■ 

Let Pp,7 be the Markov transition kernel induced on 'iciP) 
by policy 7 G F. Let 



F'(P) = {7 G F : Pp.^ is a uniformly ergodic transition 



kernel}. 



(22) 



For 7 G F'(P) let Vp^ be the stochastic kernel of 
the stationary distribution whose each row is the stationary 



vpgs, 



p- p* 



< 



1 2 log t 

Ntiu) 



distribution tt 



iV{^t,;U*).ht{TpitPt,;U*))) 

< iV{tlJt,.,u*)»h*p{Tp{i't,.,u*)))-2e 
Since on Ft (O holds, 

{lSt,u*)<C*iil^t;P)~2e} 



Assumption 3: There exists e > such that any optimal 
policy for the infinite horizon average cost MAB problem with 

transition matrices P such that P — P < e belongs to 

1 

F'(P)nF'(P). 

Since hp, ht are unique up to a constant we can set them 
equal to the bias of the optimal policies 7(P) and 7(P*) under 
P and P* respectively. Let hp^^ be the bias under policy 7 
and transition matrices P. Then, 



VIII. Conclusion 



hp,jW = £i?^^„,^[r(X,,t/,)-ffP,^] 



(23) 



Let rp^^ = (r(V',7(V')))v6*c(P)- We have gp^^ 
VpSp^-y- Using this we can write hp^^ as 

N 

^Pn = ^'Pk^rp.j - Ngp^^ 
t=i 

t=N+l 



Lemma 13: There exists i; > such that if 



Pk - Pk 



(24) 



< 



hp-y — h i 



< 



?, Vfc e M then 
r'(P). 

Lemma 14: There exists <; > such that if 

<;,Vfc e M then ||/ip - /ipll < e. 

Lemma 15: Let <j > be such that Lemma [141 holds. Then 
for a > Cp{<;/S^^^) we have 



< e, for any 7 G T'{P) D 
Pk - Pk 



Proof: We have by Lemma [T4l 



(25) 



Pk-Pt 



Thus, 



{ 
Then 



Pk^Pl 



<c:,VfceM} C {||/ip-/i*|| <e}. 



> <;, for some k G M} D {||/ip - /I'll > e|. 



<,^-4i?i,3(^,e)=i?^„.^^ 



■AT-l 



E^(^*'^*^) 



t=0 



Af-1 

<E^( 

t=0 



Pk - Pk 



1 

N-1 



> <;, for some k e Af , i?t) 



^E E E^ii^- 

fc=i(ifc,ifc)es'fcxSfc t=o 



ikjk Pk,ikjk 



> 



02 
^max 



.Et 



D. Logarithmic regret upper bound 

Theorem 1: Under Assumptions [1] [2] [3] for a sufficiently 
large, a > C{P) for any suboptimal action u €U 

E^^^^. [Tn{Gi,u)] < alogN{l + r,„ax) + {SmS^^^ + 4/(5)/3. 

Thus 

R]^ (^0; P) < (a log iV(l + T„,ax) + (8m5,Lx + 4/<5)/3) 

L 

1=1 u^O(Ji;P) 

Proof: The result follows from Lemmas H [8] IH [II] [H] 
[Bland©. ■ 



In this paper we proved that given the transition probabilities 
of the arms are positive for any state and under some assump- 
tions on the structure of the optimal policy for the infinite 
horizon average reward problem, there exists index policies 
which gives logarithmic regret with respect to the optimal 
finite horizon policy uniformly in time. Our future research 
includes finding the conditions on P such that Assumptions 
|2l[3]hold. 
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